parser: account for number of header columns in dialect detection #1359

psFried · 2024-01-30T19:05:45Z

When we parse CSVs, we consider it an error for any row to contain more values than there are header columns. But the dialect detection wasn't consistent with that behavior, and if it encountered such a row it would score it higher than it would a row containing fewer values than there are headers. The consequence of that is that we could end up scoring an incorrect quote character higher than a correct one if it produces more columns (which often the case when quoted values contain delimiters). This commit addresses that oversight by zeroing the score of any row that contains too many values. Thus it is treated the same as if the row couldn't be parsed at all. The result is that dialect detection produces a much more accurate guess of the correct quote character.

This change is

When we parse CSVs, we consider it an error for any row to contain more values than there are header columns. But the dialect detection wasn't consistent with that behavior, and if it encountered such a row it would score it higher than it would a row containing fewer values than there are headers. The consequence of that is that we could end up scoring an incorrect quote character higher than a correct one if it produces more columns (which often the case when quoted values contain delimiters). This commit addresses that oversight by zeroing the score of any row that contains too many values. Thus it is treated the same as if the row couldn't be parsed at all. The result is that dialect detection produces a much more accurate guess of the correct quote character.

psFried · 2024-01-30T19:07:17Z

crates/parser/src/format/character_separated/detection.rs

@@ -106,9 +119,14 @@ pub fn detect_dialect(
        .pop()
        .expect("must have at least one candidate dialect");
    // Log the top few candidates, as it's helpful to see the runner up when detection doesn't go as we expected
-    let runners_up = &dialects[0..(dialects.len().min(3))];
+    let runners_up = &dialects[dialects.len().saturating_sub(3)..];


The previous expression here was incorrect. So this now correctly prints the top 3 runners up, with the last value being the 2nd place finisher.

williamhbaker

LGTM

psFried requested a review from williamhbaker January 30, 2024 19:05

psFried commented Jan 30, 2024

View reviewed changes

williamhbaker approved these changes Jan 30, 2024

View reviewed changes

psFried merged commit 9b390c5 into master Jan 30, 2024
3 checks passed

psFried deleted the phil/csv-detect-better branch January 30, 2024 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser: account for number of header columns in dialect detection #1359

parser: account for number of header columns in dialect detection #1359

psFried commented Jan 30, 2024 •

edited by jgraettinger

Loading

psFried Jan 30, 2024

williamhbaker left a comment

parser: account for number of header columns in dialect detection #1359

parser: account for number of header columns in dialect detection #1359

Conversation

psFried commented Jan 30, 2024 • edited by jgraettinger Loading

psFried Jan 30, 2024

Choose a reason for hiding this comment

williamhbaker left a comment

Choose a reason for hiding this comment

psFried commented Jan 30, 2024 •

edited by jgraettinger

Loading